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ABSTRACT 



A multiprocessor system includes a number of central pro- 
cessing unit (CPUs) and at least one input/output (I/O) 
device interconnected by routing apparatus for communi- 
cating packetized messages therebetween. The messages 
contain address information identifying the source and des- 
tination of the message, and may also contain requests to 
write to, or read from, storage of a CPU. Protection against 
errant reads or writes is provided by an access validation 
method that utilizes access validation information contained 
in plural entries maintained by each CPU. Each entry 
provides validation by identifying what elements of the 
system has read and/or write wccss to the memory of that 
CPU, without which memory access is denied. 

12 Claims, 30 Drawing Sheets 
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TITLE -Tl (1): 

Storage access validation to data messages using partial storage 
address 

data indexed entries containing permissible address range validation for 
message source 

Brief Summary Text - BSTX (24): 

As indicated above, the processing system of the present invention is 
structured to provide fault-tolerant operation through both "fail-fast" and 
"fail-functional" operation. Fail-fast operation is achieved by locating 
error-checking capability at strategic points of the system. For example, 
each 

CPU has error-checking capability at a variety of points in the various data 
paths between the (lock-step operated) processor elements of the CPU and 
its 

associated memory. In particular, the processing system of the present 
invention conducts error-checking at an interface, and in a manner, that 
makes 

little impact on performance. Prior art systems typically implement 
error-checking by running pairs of processors, and checking (comparing) 
the 

data and instruction flow between the processors and a cache memory. 
This 

technique of error-checking tended to add delay to the accesses. Also, 
this 

type of error-checking precluded use of off-the-shelf parts that may be 
available (i.e., processo r/cache memory combinations on a single 
semiconductor 

chip or module). The present invention performs error-checking of the 
processors at points that operate at slower rates, such as the main 
memory and 

I/O interfaces which operate at slower speeds than the processo r-cache 
interface. In addition, the error-checking is performed at locations that 
allow detection of errors that may occur in the processors, their cache 
memory, 

and the I/O and memory interfaces. This allows simpler designs for the 
memory 

and I/O interfaces as they do not require parity or other data integrity 
checks. 
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Drawing Description Text - DRTX (36): 

FIG. 29 illustrates a portion of system memory, showing cache block 
boundaries; and 

Detailed Description Text - DETX (23): 

Turning now to FIG. 2, the CPU 12A is illustrated in greater detail. Since 
both CPUs 12A and 12B are substantially identical in structure and 
function, 

only the details of the CPU 12A will be described. However, it will be 
understood that, unless otherwise noted, the discussion of CPU 12A will 
apply 

equally to CPU 12B. As FIG. 2 shows, the CPU 12A includes a pair of 
processor 

units 20a, 20b that are configured for synchronized, lock-step operation in 
that both processor units 20a, 20b receive and execute identical 
instructions, 

and issue identical data and command outputs, at substantially the same 
moments 

in time. Each of the processor units 20a and 20b is connected, by a bus 21 
(21a, 21b) to a corresponding cache memory 22. The particular type of 
processor units used could contain sufficient internal cache memory so 
that the 

cache memory 22 would not be needed. Alternatively, cache memory 22 
could be 

used to supplement any cache memory that may be internal to the 
processor units 

20. In any event, if the cache memory 22 is used, the bus 21 is structured 
to 

conduct 128 bits of data, 16 bits of error-correcting code (ECC) check bits, 
protecting the data, 25 tag bits (for the data and corresponding ECC), 3 
check 

bits covering the tag bits, 22 address bits, 3 bits of parity covering the 
address, and 7 control bits. 

Detailed Description Text - DETX (25): 

The X and Y interface units 24a, 24b operate to communicate data and 
command 

signals between the processor units 20a, 20b and a memory system of the 
CPU 

12A, comprising a memory controller (NIC) 26 (composed of two MC halves 
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26a and 

26b) and a dynamic random access memory array 28. The interface units 
24 

interconnect to each other and to the Mcs 26a, 26b by a 72-bit 
address/command 

bus 25. However, as will be seen, although 64-bit doublewords of data 
(accompanied by 8 bits of ECC) are written to the memory 28 by the 
interface 

units 24, one interface unit 24 will drive only one word (e.g., the 32 most 
significant portion) of the doubleword being written while the other 
interface 

unit 24 writes the other word of the double word (e.g., the least significant 
32-bit portion of the doubleword). In addition, on each write operation the 
interface units 24a, 24b perform a cross-check operation on the data not 
written by that interface unit 24 with the data written by the other to 
check 

for errors; on read operations the addresses put on the bus 25 are also 
cross-checked in the same manner. The particular ECC used for protecting 
both 

the data written to the cache memory 22 as well as the (main) memory 28 
is 

conventional, and provides single-bit error correction, double-bit error 
detection. 

Detailed Description Text - DETX (137): 

C : (Cache Coherency ) This is a two bit field, encoded to specify how 
write 

requests to the memory 28 will be handled. Set to one state, the requested 
write operation will be processed normally; set to a second state, write 
requests specifying addresses with a fractional cache line included at the 
upper or lower bound of the AVT entry mapped area of memory are written 
to the 

cache coherency queue maintained by an interrupt handler 250 (FIG. 14A), 
described below. This allows the CPU 12 to manage write transfers into a 
user 

data structure or buffer area in the memory 28 which does not have full 
cache 

line alignment set to a third state, all write requests accessing this AVT 
entry are written to the cache coherency queue. Set to the fourth state, 
the 
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physical memory locations referenced by this AVT entry are accessed 
using 

hardware coherency mechanisms. 

Detailed Description Text - DETX (175): 

Preferably, the AVT entry register 180 is configured to operate like a 
single line cache, complete with a TAG and valid bit. The TAG would 
consist of 

the portion of the TNet address used to look up the AVT entry from the 
system 

memory 28. In normal operation, if the TAG does not match the TNet 
address of 

an incoming packet, the correct AVT entry is read from system memory 28 
and 

read into the AVT entry register 206, replacing the old AVT entry. Those 
skilled in this art will recognize that other cache organizations are possible 
such as set-associative, fully-associate, or direct-mapped, to name a few. 

Detailed Description Text - DETX (176): 
Coherency : 

Detailed Description Text - DETX (177): 

Data processing systems that use cache memory have long recognized 
the 

problem of coherency : making sure that an access to cache or main 
memory never 

returns stale data, or overwrite good (up-to-date) data. There are 
numerous 

solutions to this problem, many of which make use of extensive and 
complex 

hardware. The coherency problem also arises when data is written to 
memory 

from external (to the CPU) I/O or another CPU 12, as in the context of the 
system 10 (e.g., FIG. 2), data is written to the memory 28 of the CPU 12A 
by 

the CPU 12B. One solution is to ensure that incoming data is written to 
memory 

buffers such that the bounds of the buffer are aligned with cache block 
boundaries. This solution, however, finds application only when used with 
software schemes to invalidate cache blocks used for incoming data, and 
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forcing 

write-back of cache blocks used for out-going data. 

Detailed Description Text - DETX (178): 

Thus, there exist traditional techniques for software management of 
coherency problems suitable for incoming read requests (from I/O, or 
another 

CPU 12), and outgoing read and write requests. However, the traditional 
techniques do not lend themselves to managing incoming write requests to 
an I/O 

buffer in memory 28 that is not aligned on cache block boundaries. 

Detailed Description Text - DETX (179): 

However, requiring alignment of the I/O buffers in memory on cache block 
boundaries results in a less flexible system, and a system that can be 
incompatible with existing (operating system) software. Therefore, the 
interrupt mechanism of the present invention is used to establish 
coherency in 

a manner that allows data buffers to be located in memory without concern 
as to 

whether or not the boundary of that buffer is aligned with the cache block 
boundaries. 

Detailed Description Text - DETX (180): 

In this connection, the field in the AVT table Entry register 180 (FIG. 11) 
defining the upper and lower boundaries (upr bnd, Iwr bnd) of the area of 
memory 28 to which the source of the incoming packet is permitted access 
are 

applied to a boundary crossing (Bdry Xing) check unit 219. Boundary 
check unit 

219 also receives an indication of the size of the cache block the CPU 12 is 
configured to operate with, the coherency bits ("c[1:0]") from the 
Permissions 

field of the AVT entry held in the AVT Entry register 180, and the Len field 
of 

the header information from the AVT input register 170. The Bdry Xing unit 
determines if the data of the incoming packet is not aligned on a cache 
boundary, and if the coherency bits ("c[1:0]") are set appropriately, will 
force the fetch of an address of an interrupt entry that will be used to point 
to the special coherency queue for storing the data and the header of the 
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packet containing that data. 



Detailed Description Text - DETX (181): 

Referring for the moment to FIG. 29, there is illustrated a portion 28 f of 
the memory space implemented by the memory 28 (FIG. 2) of a CPU 12. As 
FIG. 29 

further illustrates, three cache boundaries CB.sub.a, CB.sub.b, and 
CB.sub.c 

are contained with the memory portion 28', defining two cache blocks 
C.sub.-- 

BLK.sub.a and C.sub.-- BLK.sub.b. Assume that a write request message 
packet 

is received (e.g., from another CPU 12, or an I/O device), and that the data 
contained in that message packet, indicated by the cross-hatching, is to be 
written to an area of memory 28 that includes the memory portion 28". In 
fact, 

the data that will be written will only partially write over the cache block 
C.sub.-- BLK.sub.a, but will completely write over the cache block C.sub.-- 
BLK.sub.b, and other cache blocks. If the cache 22 of the CPU 12 being 
written 

contains the cache block C.sub.- BLK.sub.b, or any other cache block 
other 

than cache block C.sub.-- BLK.sub.a (or the cache block containing the 
other 

end of the incoming data, if not aligned on a cache boundary), the block 
can be 

marked as "invalid," preventing it from being written back into memory and 
over 

the newly received data. 

Detailed Description Text - DETX (182): 

However, if the cache 22 contains the cache block C.sub.- BLK.sub.a, 
the 

boundary crossing logic 219 (if enabled by the "c" being set in the 
Permissions 

field; see FIGS. 11 and 13B) of the AVT 90 (FIG. 11) needs to detect the I/O 
packet partially invalidating the cache entry, and force a coherency 
interrupt. 

This results in the fetch of an interrupt descriptor, containing a pointer to a 
special interrupt queue, and the entire incoming TNet request packet will 
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be 

written to the queue. At the same time an interrupt will be written to the 
queued interrupt register 280, to alert the processors 20 that a portion of 
the 

incoming data is located in the special queue. 

Detailed Description Text - DETX (183): 

In short, if an incoming packet has data that is to be written to memory 
28, 

the boundary crossing logic 219 checks to see if the boundaries of the 
buffer 

at which the data will be written are aligned with the cache boundaries. If 
so, the data will be written as directed. If not, the packet (both header and 
data) is written to a special queue, and the processors so notified by the 
intrinsic interrupt process described above. The processors may then 
move the 

data from the special queue to cache 22, and later write the cache to 
memory 28 

to ensure that good data is not over-written or otherwise lost, and that 
coherency between the cache 22 and the memory 28 is preserved. 

Detailed Description Text - DETX (357): 

"Asymmetric variables" are values which are, or may be, different in one 
of 

a pair CPUs 12 from that of the other. Examples of asymmetric variables 
can 

include a serial number assigned and kept in a CPU-readable location, for 
example a register outside memory 28, which will be different from that of 
any 

other CPU, or a content of a register used to track the occurrence of 
correctable memory or cache errors (assuming that detecting, correcting 
and 

reporting the error does not cause the duplexed CPUs to lose lock-step 
synchronism). 

Detailed Description Text - DETX (439): 

The procedure now moves to step 1080 (FIG. 33B) to setup the 
monitoring of 

memory and state (e.g., registers, cache, etc.) that is done while memory 

is 
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being copied from the on-line CPU 12A to the off-line CPU 12B. The step of 
copying the state of the on-line CPU to the off-line CPU could be 
accomplished 

merely by halting all on-going operation of the on-line CPU, writing the 
state 

of all configuration registers and control registers (e.g., configuration 
registers 74 of the interface units 24) cache, and the like to memory 28 of 
the 

on-line CPU, copying the entire content of the memory 28 to the off-line 
CPU, 

and vectoring both CPUs to a reset routine that will bring them up 
together. 

However, for large systems, this could take tens of seconds or more to 
accomplish, an unacceptable amount of time to have the system 10 off-line 
for 

reintegration. For that reason, the reintegration process is performed in a 
manner that allows the on-line CPU to continue executing user application 
code 

while most of the operation copying state over to the off-line CPU is done 
in 

background. 

Detailed Description Text - DETX (455): 

Thus, the reintegration procedure moves to the sequence of steps 
illustrated 

in FIG. 33C, where at step 1100, the on-line CPU 12A momentarily halts 
foreground processing, i.e., execution of a user application. The remaining 
state (e.g., configuration registers, cache, etc.) of the on-line processors 
20 

and its caches is then read and written to a buffer (series of memory 
locations) in the memory 28 (step 1 1 02). That state is then copied over to 
the 

off-line CPU 12B, together with a "reset vector" that will direct the 
processor 

units 20 of both CPUs 12A, 12B to a reset instruction. 

Detailed Description Text - DETX (456): 

Next, step 1106 will quiesce the routers 14A, 14B by a SLEEP symbol, 
followed by a self-addressed message packet to ensure that the FIFOs of 
the 
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routers are clear, that the FIFOs of the processor interfaces 24 are clear, 
and 

no further incoming I/O message packets are forthcoming. At step 1108 
the 

on-line CPU 12A transmits an SRST command symbol to the routers 14A, 
14B which 

will echo the SRST symbol back to both CPUs 12A, 12B. Since the echoing 
router 

is still operating in the slave duplex mode described above, the SRST 
echoed to 

the off-line CPU 12B will still be the 8 clocks after that echoed to the 
on-line CPU 12A. The echoed SRST symbol will be received and acted 
upon by 

both CPUs 12A, 12B, to cause the processor units 20 of each CPU to jump 
to the 

location in memory 28 containing the reset vector and initiate a subroutine 
that will restore the stored state of both CPUs 12A, 12B to the processor 
units 

20, caches 22, registers, etc. The CPUs 12A, 12B will then begin executing 
the 

same instruction stream. 

Detailed Description Text - DETX (468): 

Thus, the CPU 12B' comprises only a single processor unit 20' and 
associated 

support components, including the cache 22', interface unit (IU) 24', 
memory 

controller 26', and memory 28*. Thus, while the CPU 12A is structured in 
the 

manner shown in FIG. 2, with cache processor unit, interface unit, and 
memory 

control redundancies, approximately one-half of those components are 
needed to 

implement CPU 12B*. 
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Memory Access Protocol," Ser. No. 09/652,834, filed Aug. 31 , 2000, 
"Special 

Encoding Of Known Bad Data," Ser. No. 09/652,341, now U.S. Pat. No. 
6,662,319, filed Aug. 31, 2000, "Mechanism To Track All Open Pages In A 
DRAM 

Memory System," Ser. No. 09/652,704, now U.S. Pat. No. 6,662,265, filed 
Aug. 

31, 2000. "Programmable DRAM Address Mapping Mechanism," Ser. No. 
09/653,093, 

now U.S. Pat. No. 6,546,453, filed Aug. 31, 2000, "Computer Architecture 
And 

System For Efficient Management of Bi-Directional BusMechanism" Ser. 
No. 

09/652,323, filed Aug. 31, 2000, "An Efficient Address Interleaving With 
Simultaneous Multiple Locality Options," Ser. No. 09/652,452, now U.S. 
Pat. 

No. 6,567,900, filed Aug. 31, 2000, "A High Performance Way Allocation 
Strategy For A Multi-Way Associative Cache System," Ser. No. 09/653,092, 
filed 

Aug. 31, 2000, "Method And System For Absorbing Defects In High 
Performance 

Microprocessor With A Large N-Way Set Associative Cache," Ser. No. 
09/651,948, 

now U.S. Pat. No. 6,671,822, filed Aug. 31, 2000, "A Method For Reducing 
Directory Writes And Latency In A High Performance Directory Based, 
Coherency 

Protocol," Ser. No. 09/652,324, now U.S. Pat. No. 6,654,859, filed Aug. 31, 
2000, "Mechanism To Reorder Memory Read And Write Transactions For 
Reduced 

Latency And Increased Bandwidth," Ser. No. 09/653,094, now U.S. Pat. 
No. 

6,591,349, filed Aug. 31, 2000, "System For Minimizing Memory Bank 
Conflicts 

in A Computer System," Ser. No. 09/652,325, now U.S. Pat. No. 6,622,225, 
filed Aug. 31, 2000, "Computer Resource Management And Allocation 
System" Ser. 

No. 09/651,945, filed Aug. 31, 2000, "Input Data Recovery Scheme," Ser. 
No. 

09/653,643, now U.S. Pat. No. 6,668,335, filed Aug. 31, 2000, "Fast Lane 
Prefetching," Ser. No. 09/652,451 , now U.S. Pat. No. 6,681 ,295, filed Aug. 
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31, 2000, "A Mechanism For Synchronizing Multiple Skewed Source- 
Synchronous 

Data Channels With Automatic Initialization Feature," Ser. No. 09/652,480, 
now 

U.S. Pat. No. 6,636,955, filed Aug. 31, 2000, "A Mechanism To Control The 
Allocation Of An N-Source Shared Buffer," Ser. No. 09/651,924, filed Aug. 
31, 

2000, and "Chaining Directory Reads And Writes To Reduce DRAM 
Bandwidth In A 

Directory Based CC-NUMA Protocol," Ser. No. 09/652,315, now U.S. Pat. 
No. 

6,546,465, filed Aug. 31, 2000, all of which are incorporated by reference 
herein. 



_ KWIC 

Brief Summary Text - BSTX (15): 

The problems noted above are solved in large part by a directory-based 
multiprocessor cache control system for distributing invalidate messages 
to 

change the state of shared data in a computer system. The plurality of 
processors may be grouped into a plurality of clusters. A directory 
controller 

tracks copies of shared data sent to processors in the clusters. This 
tracking 

is accomplished using a share mask data register that contains at least as 
many 

bit locations as there are clusters. When a block of data from main 
memory is 

distributed to a processor, the directory controller will set a bit in the 
share mask corresponding to the cluster in which the sharing processor is 
located. Upon receiving an exclusive request from a processor requesting 
permission to modify a shared copy of the data, the directory controller 
generates invalidate messages requesting that other processors sharing 
the same 

data invalidate that data. These invalidate messages are sent via a 
point-to-point transmission only to master processors in clusters actually 
containing a shared copy of the data. Upon receiving the invalidate 
message, 
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the master processors broadcast the invalidate message in an ordered 
fan-in/fan-out process to each processor in the cluster. The path by which 
the 

invalidate messages are broadcast within a cluster is determined by 
control and 

status registers associated with each processor in the system. These 
registers 

include configuration information which establishes to which processors, if 
any, a processor should forward the broadcast Invalidate message . All 
processors within the cluster invalidate a local copy of the shared data if it 
exists and if the processor is not a requestor. The processors then send 
acknowledgement messages to the processor from which the invalidate 
message was 

received. Once the master processor receives acknowledgements from all 
processors in the cluster, the master processor sends an invalidate 
acknowledgment message to the processor that originally requested the 
exclusive 

rights to the shared data. The cache coherency is scalable and may be 
implemented using the hybrid point-to-point/broadcast scheme or a 
conventional 

point-to-point only directory-based invalidate scheme. A PID-SHIFT 
register 

holds configuration information that determines which implementation 
shall be 

used. If the PID-SHIFT register holds the value zero, a conventional 
point-to-point invalidate scheme will be used. For other values in the 
PID-SHIFT register, the value determines the number of processors 
grouped per 

cluster and establishes that the hybrid invalidate scheme shall be used. 
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(57) ABSTRACT 

A computer system (10) includes a node controller (12) 
operable to process invalidation requests. The node control- 
ler (12) includes a network interface unit (20), a memory 
directory interface unit (22), a processor interface unit (24), 
an input/output interface unit (26), a local buffer unit (28), 
and a crossbar unit (30). A local processor (16) generates an 
invalidation request that is processed by the processor 
interface unit (24) for placement into the local buffer unit 
(28). The invalidation request indicates that particular data 
within a local memory (18) associated with the node con- 
troller (12) has been altered by the local processor (16). The 
local buffer unit (28) generates a plurality of invalidation 
messages in response to the invalidation request, the invali- 
dation messages being destined for remote processors (16) 
associated with remote node controllers (12) in the computer 
system (10) that share the particular data. The crossbar unit 
(30) arbitrates the transfer of the invalidation messages with 
data, control messages, and other traffic to and from all units 
associated with the node controller (12) so that the node 
controller (12) is not clogged with the transfer of invalida- 
tion messages. 

20 Claims, 4 Drawing Sheets 




RETURNED 
INVALIDATION 
REQUEST 



~I 




23 


1 




LOCAL BUFFER 




UNIT 





LOCAL N00E 
CONTROLLER 



ACKS 



ii 



TO/FROM — 
1/0 DEVICE — 




TO 
NETWORK 



LOCAL BUFFER 


23 


UNIT 




r — 


- T 






I 



10 l/F 
UNIT 



FROM 
NETWORK 



REMOTE NODE 
CONTROLLER 




T0/FR0U LOCAL PROCESSOR 



3/1/05, EAST version: 2.0.1.4 



U.S. Patent 



Jul. 16, 2002 



Sheet 1 of 4 



US 6,421,712 Bl 



17 


PROCESSOR 


18 


\ 






MEMORY 




NODE 
CONTROLLER 




I/O 
DEVICE 



10 



FIG. 1 



17 


PROCESSOR 


/ 




MEMORY 




NODE 
CONTROLLER 




16 



18 

z_ 



I/O 
DEVICE 



TO/FROM 
I/O DEVICE 



12 



TO/FROM NETWORK 

_J L_ 





LOCAL 




BUFFER 


28 


UNIT 









10 1/F 
UNIT 



NETWORK 
l/F UNIT 

20 



CROSSBAR 
UNIT 




PROCESSOR 
l/F UNIT 

24 



MEMORY/ 
DIRECTORY 
l/F UNIT 




l^_ts. TO/FROM 
r*-^ MEMORY 



TO/FROM LOCAL PROCESSOR 

FIG. 2 



3/1/05, EAST version: 2.0.1.4 



U.S. Patent Jul. 16, 2002 Sheet 2 of 4 US 6,421,712 Bl 




o 

EC 



(0 



CO 



3g 




O O u£ 


DATA 
IFO 




*c + l*- 




^ + ij - 


FIFO, 
Hdr 


FIFO. 
Hdr 


FIFO. 
Hdr 


■« 


MOQ 
192 ENTRIES 
3x(4+12 hdr) 
3x(0+48 doto) 





7Y 



O 
rO 



1: 



csii 



3/1/05, EAST version: 2.0.1.4 



U.S. Patent Jul. 16, 2002 Sheet 3 of 4 US 6,421,712 Bl 



TO/FROM NETWORK 
ft ACKS 



LOCAL BUFFER 

UNIT 
28 r — -, 

l 
l 



TO/FROM c£> 
I/O DEVICE <£: 



10 l/F 
UNIT 



12 




20 

NETWORK I/F UNIT 

I INVALIDATION 
MESSAGES 



30 



CROSSBAR 
UNIT 



PROCESSOR 
l/F UNIT 

24 



MEMORY/ 
DIRECTORY 
l/F UNIT 



i^t-jv. TO/FROM 
•^""^ MEMORY 




INVALIDATION 
REQUEST 



TO/FROM LOCAL PROCESSOR 

FIG. 4A 



3/1/05, EAST Version: 2.0.1.4 



U.S. Patent 



Jul. 16, 2002 



Sheet 4 of 4 



US 6,421,712 Bl 



INVALIDATION 
REQUEST 




MEMORY/ 
DIRECTORY 
l/F UNIT 




PROCESSOR 
l/F UNIT 



NETWORK 
l/F UNIT 



RETURNED 
INVALIDATION 
REQUEST 




CROSSBAR 
UNIT 



L i 

LOCAL BUFFER 
UNIT 



^1 



LOCAL NODE 
CONTROLLER 



ACKS 



FIG. 4B 



CD 



TO 
NETWORK 



TO/FROM 
I/O DEVICE 



12 



LOCAL BUFFER 
™ UNIT 

l 



10 l/F 
UNIT 




tu 



FROM 
NETWORK 



NETWORK 
l/F UNIT 

20 



4- 



CROSSBAR 
UNIT 



30 



MEMORY/ 
DIRECTORY 
l/F UNIT 



PROCESSOR 
I/F UNIT 

24 



"1 F 

TO/FROM LOCAL PROCESSOR 




REMOTE NODE 
CONTROLLER 



TO/FROM 
MEMORY 



3/1/05, EAST Version: 2.0.1.4 



US 6,421,712 Bl 

1 2 

METHOD AND APPARATUS FOR therefrom. Another technical advantage is the use of a 

BROADCASTING INVALIDATION dedicated crossbar port to interleave large invalidation mes- 

MESSAGES IN A COMPUTER SYSTEM sage groups with other classes of traffic flowing through the 

node controller. Yet another technical advantage is the 

TECHNICAL FIELD OF THE INVENTION 5 ability to process all classes of traffic despite having a large 

number of invalidation requests to process. Still another 

The present invention relates in general to computer technical advantage is to locally process invalidation 

architecture and more particularly to a method and apparatus fequests despite tfae fact that lhe memory location is 

for broadcasting invalidation messages in a computer sys- assoc i a ted with a remote node controller. Other technical 

tem - io advantages may be readily apparent to those skilled in the art 

BACKGROUND OF THE INVENTION from the Ml °^ fi ^ es ' description, and claims. 

When a memory location is altered in a computer system, BRIEF DESCRIPTION OF THE DRAWINGS 

nodes and processors within that computer system that rely For a more complete understanding of the present inven- 

on and share the contents of that memory location must be 15 ^ and ^ advantages thereof> re f er ence is now made to 

informed that their version of the contents of the memory the foUowing description taken in conjunction with the 

location have been altered and are no longer valid. An accompany i n g drawings, wherein like reference numerals 

invalidation engine in a node of a computer system is used represent par ts, in which: 

to process invalidation requests from a processor associated . .„ t . . ,. c 

•Zt . • . . «. , . , . . „ n FIG. 1 illustrates a block diagram of a computer system; 

with the node that alters the shared memory by issuing an 20 & r J 

invalidation message to each affected node of the computer FIG. 2 illustrates a simplified block diagram of a node 

system indicated in the invalidation request. A single invali- controller in the computer system; 

dation request may require a multitude of invalidation FIG. 3 illustrates a simplified block diagram of a crossbar 

messages to be sent out the network port and across the unit in the node controller; 

interconnect to each affected node depending on the size of 25 FIGS. 4A and 4B illustrate the processing and broadcast- 
the computer system. The amount of effort and time required mg 0 f invalidation messages in the node controller, 
to broadcast a burst of invalidation requests and the invali- 
dation messages associated therewith create several prob- DETAILED DESCRIPTION OF THE 
lems at the node. For example, the flood of invalidation INVENTION 
messages from a node may monopolize its network port and 30 . 

cause other traffic to wait before being transferred or be „ FIG - 1 15 f bl °° k dia ? am ° f a , «> m P u ' ^""V 

potentially discarded altogether. Nodes may block the send- ^mputer system 10 include, a Plurality of node controllers 

■ f * u • i-j*- c 12 interconnected by a network 14. Each node controller 12 

mg of invalidation requests to a busy invalidation engine for ^^^"^u y a A 

. & . , ^ 1# . ,,1,, i .„ processes data and traffic both internally and with other node 

long periods of time, resultmg in the tying up of the node s v „ ... 4 * m i + a 

, , f • *u ri ♦ .«« « controllers 12 within computer system 10 over network 14. 

crossbar resources and preventing the ability to process 35 * ' 

£• - 1 . , f ■ j • lu L,~ a u n „; Each node controller may communicate with a local pro- 
further inputs. Therefore, it is desirable to avoid havmg J . i • 

invalidation requests clog up the operation of a computer cessor ^ memorv device 17 > ^ a local m P ut/ 

system. output device 18. 

FIG. 2 is a block diagram of node controller 12. Node 

SUMMARY OF THE INVENTION 40 controller 12 includes a network interface unit 20, a memory 

directory interface unit 22, a processor interface unit 24, an 

From the foregoing, it may be appreciated that a need has input/output interface unit 26, a local buffer unit 28, and a 

arisen for an invalidation engine technique that won't tie up crossbar unit 30. Network interface unit 20 may provide a 

computer system resources. In accordance with the present communication link to network 14 in order to transfer data, 

invention, a method and apparatus of broadcasting invali- 45 messa g e s, and other traffic to other node controllers 12 in 

dation messages in a computer system are provided that computer system 10. Processor interface unit 22 may pro- 

substantially eliminate or reduce disadvantages and prob- v id e a communication link with one or more local proces- 

lems associated with conventional invalidation engine tech- sors \§ Memory directory interface unit 22 may provide a 

niques. communication link with one or more local memory devices 

According to an embodiment of the present invention, 50 17. Input/output interface unit 26 may provide a communi- 

there is provided a node controller for broadcasting invali- cation link with one or more local input/output devices 18. 

dation messages in a computer system that includes a Local buffer unit 28 is dedicated to processing invalidation 

memory directory unit for controlling access to data within requests from local processor 16 or from a remote processor 

a local memory device. A network interface unit is operable associated with a remote node controller 12. Crossbar unit 

to provide data and control messages to and receive data and 55 30 arbitrates the transfer of data, messages, and other traffic 

control messages from other node controllers in the com- for node controller 12. 

puter system. A local buffer unit is operable to receive an FIG. 3 is a block diagram of crossbar unit 30. Crossbar 

invalidation request and generate a plurality of invalidation un j t 30 includes a network interface output queue 40, a 

messages therefrom. A crossbar unit arbitrates the transfer of memory output queue 42, an input/output input queue 44, an 

data and invalidation messages for the memory directory 60 input/output output queue 46, a local buffer input queue 48, 

unit, the network interface, and the local buffer unit through a i oca ] buffer output queue 50, a processor interface output 

an interleaving technique to prevent blocking of node con- queue 52, a processor interface input queue 54, an arbiter 56, 

troller operation during processing of invalidation requests. an d a datapath crossbar 58. Datapath crossbar 58 provides 

The present invention provides various technical advan- data, messages, and other traffic to memory director inter- 

tages over conventional invalidation engine techniques. For 65 face unit 22 and network interface unit 20. Datapath crossbar 

example, one technical advantage is the use of a local buffer 58 provides data, messages, and other traffic to processor 

to queue invalidation requests and messages generated interface input queue 54 and input/output input queue 44. 
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Datapath crossbar 58 provides invalidation requests to local system that satisfies the advantages set forth above, 

buffer input queue 48 for processing by local buffer unit 28. Although the present invention has been described in detail, 

Datapath crossbar 58 receives invalidation messages from jt should be understood that various changes, substitutions, 

local buffer output queue 50 as generated by local buffer unit arjd alterations may be readily ascertainable by those of skill 

28. Datapath crossbar 58 also receives data from memory 5 in the afl and be made herein without departing from 

output queue 42 and data, messages, and other traffic from me spirit and ^ of the p res ent invention as defined by the 

input/output output queue 46. Datapath crossbar 58 also r . . r 

receives data, control messages, other traffic, and invalida- tallowing c aims, 

tion requests from processor interface output queue 52 and What is claimed is: 

network interface output queue 40. Arbiter 56 determines 10 1. A node controller for broadcasting invalidation mes- 

the configuration of datapath crossbar 58 in transferring sages in a memory system, comprising: 

data, control messages, other traffic, and invalidation ,. . . . . 

requests among all queues within crossbar unit 30 and units a mGm ^ ***** unit operable to control access of data 

of node controller 12. Wltmn a local memor y device; 

FIGS. 4A and 4B shown how invalidation requests are 15 a network interface unit operable to receive data and 

processed by node controller 12. In FIG. 4A, local processor control messages from and provide data and control 

16 accesses a memory location within local memory 18 messages to other node controllers; 

through memory directory interface unit 22 and processor a local buffer operable to receive an invalidation request, 

interface unit 24. If local processor 16 alters the particular the local buffer operable to generate a plurality of 

data at the accessed memory location of local memory 18, 2 o invalidation messages in response to the invalidation 

local processor 16 generates an invalidation request pro- request; 

vided to processor interface unit 26 for transfer to memory . . . ... . P t A . A 

r . 4 _ . A « w • * a crossbar unit operable to arbitrate a transfer of data and 

directory interface unit 22. Memory directory interface unit r 

generates a compact form of the invalidation request that invalidation messages between the memory directory 

includes an identity list for all of the remote processors 16 25 ^ the network mlerfacc umt ' and tne local buffer ' 

of remote node controllers 12 in computer system 10 that 2 - ^ node controller of claim 1, further comprising: 

share the particular data being altered. The invalidation a processor interface unit operable to receive data and 

request is provided to local buffer input queue 48 through control messages from and provide data and control 

processor interface output queue 52 as transferred by data- messages to a local processor. 

path crossbar 58 in crossbar unit 30. Local buffer unit 28 30 3. The node controller of claim 2, wherein the data is 

processes the invalidation request by generating an invali- shared among a plurality of remote processors of a plurality 

dation message for each remote processor 16 indicated 0 f remote node controllers. 

within the invalidation request. The invalidation message 4 xh e node controller of claim 3, wherein the local 

notifies the remote processor 16 that its version of the processor generates the invalidation request, the invalidation 

particular data is no longer valid. Local buffer unit 28 35 fequest indicating ma t particular data within the local 

provides the invalidation messages to local buffer output me ^ desifed tQ be by the local processorj thc 

queue 50 for transfer to network interface unit 22 through ^ ^ bd bk tQ . Qvide ^ invalidation 

datapath crossbar 58 as determmed by arbiter 56. Arbiter 56 \ . t , , 1 « cc *l. u *u u •* *u 1 1 

\ re * request to the local buffer through the crossbar umt, the local 

interleaves the invalidation messages with other traffic usmg , \ , . ^ ® . . r . ... 

. * *u* j .11 buffer being operable to generate the plurality of mvaUda- 

any desired fairness algorithm to ensure that node controller 40 & *. , - - « 

12 continues to provide a robust operation capability. tion messages in response to the mvahdation request, each 

Acknowledgment messages are generated by remote pro- ^validation message destmed for a remote processor that 

cessors 16 upon receiving and processing its associated shares the particular data. 

invalidation message. The acknowledgment messages are 5 - node controller of claim 4, wherein the local 
transferred to the local processor 16 that generated the 45 processor receives an acknowledgment from the remote 
invalidation request to indicate that the remote processor is processor sharing the particular data, the acknowledgment 
aware that its version of the particular data is no longer valid. indicating that the remote processor has received its invali- 
FIG. 4B shows an alternative processing scheme for dation message and providing notice that the remote pro- 
invalidation requests. A local processor 16 may alter a cessor has invalidated the particular data, 
memory location located at a remote memory 17 associated 50 6- The node controller of claim 3, wherein the local 
with a remote node controller 12. The local processor processor generates an invalidation request, the invalidation 
generates an invalidation request that is sent to the remote request indicating that particular data within a remote 
node controller 12 where the primary storage for the memory is desired to be altered by the local processor, the 
memory location is maintained. The invalidation request is local processor being operable to provide the invalidation 
then processed as discussed above. However, if the remote 55 request to a local buffer of a remote node controller through 
node controller 12 does not have the available resources to the crossbar unit and the network interface unit, 
process the invalidation request, the remote node controller 7. The node controller of claim 6, wherein the local 
12 returns the invalidation request to the local processor 16. processor receives the invalidation request returned from the 
The local processor 16 then processes the invalidation remote node controller, return receipt of the invalidation 
request as discussed above. Acknowledgment messages are 60 request indicating that the remote node controller cannot 
sent to the local processor 16 that generated the invalidation. process the invalidation request. 

Thc local processor 16 generating the invalidation request 8. The node controller of claim 7, wherein the local 

may also forward the invalidation request to another node processor providing the invalidation request to the local 

controller 12 for processing. buffer for processing. 

Thus, it is apparent that there has been provided, in 65 9. The node controller of claim 7, wherein the remote 

accordance with the present invention, a method and appa- node controller provides invalidation request to another 

ratus of broadcasting invalidation messages in a computer node controller for processing. 
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10. The node controller of claim 3, wherein the local 
processor receives the invalidation request from a remote 
node controller, the local processor providing the invalida- 
tion request to the local buffer for processing. 

11. The node controller of claim 1, further comprising: 5 
an input/output unit operable to receive data and control 

messages from and provide data and control messages 
to a peripheral element. 

12. A method of broadcasting invalidation messages in a 
computer system, comprising: 10 

receiving an invalidation request, the invalidation request 
indicating that particular data in memory is being 
altered; 

generating an invalidation message for each remote pro- 15 
cessor that shares the particular data in response to the 
invalidation request, the invalidation message inform- 
ing an associated remote processor that its version of 
the particular data is no longer valid; 

interleaving a transfer of the invalidation message with 2 o 
other classes of traffic to prevent clogging a portion of 
the computer system with the processing of invalida- 
tion requests. 

13. The method of claim 12, further comprising: 
generating the invalidation request at a processor associ- 25 

ated with the invalidation message generation. 

14. The method of claim 12, further comprising: 
generating the invalidation request at a processor remote 

from the invalidation message generation. 
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15. The method of claim 14, further comprising: 
returning the invalidation message to the processor 

remote from the invalidation message generation- in 
response to a determination that there is no available 
resources to generate the invalidation messages. 

16. The method of claim 12, further comprising: 
receiving an acknowledgment message in response to the 

transfer of the invalidation message. 

17. The method of claim 14, further comprising: 
receiving an acknowledgment message in response to the 

transfer of the invalidation message, the acknowledg- 
ment message indicating that a remote processor has 
received and processed its invalidation message; 
passing the acknowledgment message to the processor 
remote from the invalidation message generation that 
generated the invalidation request. 

18. The method of claim 12, further comprising: 
arbitrating the invalidation message with other classes of 

traffic to determine an order of transfer. 

19. The method of claim 18, wherein the arbitration is 
performed using a standard fairness mechanism. 

20. The method of claim 11, further comprising: 
specifying the remote processors being affected by the 

altering of the particular data in a compact form. 

***** 
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BROADCAST INVALIDATE SCHEME 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 

5 

This application relates to the following commonly 
assigned co-pending applications entitled: 

"Apparatus And Method For Interfacing A High Speed 
Scan-Path With Slow Speed Test Equipment," Sen No. 10 
09/653,642, filed Aug. 31, 2000. "Priority Rules For Reduc- 
ing Network Message Routing Latency," Ser. No. 09/652, 
322, filed Aug. 31, 2000. "Scalable Directory Based Cache 
Coherence Protocol," Ser. No. 09/652,703, now U.S. Pat. 
No. 6,633,960 filed Aug. 31, 2000, "Scalable Efficient I/O 15 
Port Protocol," Ser. No. 09/652,391, filed Aug. 31, 2000, 
"Efficient Translation Lookaside Buffer Miss Processing In 
Computer Systems With A Large Range Of Page Sizes," Ser. 
No. 09/652,552, filed Aug. 31, 2000, "Fault Containment 
And Error Recovery Techniques in A Scalable 20 
Multiprocessor," Ser. No. 09/651,949, now U.S. Pat. No. 
6,678,840, filed Aug. 31, 2000, "Speculative Directory 
Writes In A Directory Based Cache Coherent Non-uniform 
Memory Access Protocol," Ser. No. 09/652,834, filed Aug. 
31, 2000, "Special Encoding Of Known Bad Data," Ser. No. 
09/652^41, now U.S. Pat. No. 6,662,319, filed Aug. 31, 
2000, "Mechanism To Track All Open Pages In A DRAM 
Memory System," Ser. No. 09/652,704, now U.S. Pat. No. 
6,662,265, filed Aug. 31, 2000. "Programmable DRAM 30 
Address Mapping Mechanism," Ser. No. 09/653,093, now 
U.S. Pat. No. 6,546,453, riled Aug. 31, 2000, "Computer 
Architecture And System For Efficient Management of 
Bi-Directional BusMechanism" Ser. No. 09/652,323, filed 
Aug. 31, 2000, "An Efficient Address Interleaving With 35 
Simultaneous Multiple Locality Options," Ser. No. 09/652, 
452, now U.S. Pat. No. 6,567,900, filed Aug. 31, 2000, "A 
High Performance Way Allocation Strategy For A Multi- 
Way Associative Cache System," Ser. No. 09/653,092, filed 
Aug. 31, 2000, "Method And System For Absorbing Defects 40 
In High Performance Microprocessor With A Large N-Way 
Set Associative Cache," Ser. No. 09/651,948, now U.S. Pat. 
No, 6,671,822, filed Aug. 31, 2000, "A Method For Reduc- 
ing Directory Writes And Latency In A High Performance 
Directory Based, Coherency Protocol," Ser. No. 09/652,324, 45 
now U.S. Pat. No. 6,654,859, filed Aug. 31, 2000, "Mecha- 
nism To Reorder Memory Read And Write Transactions For 
Reduced Latency And Increased Bandwidth," Ser. No. 
09/653,094, now U.S. Pat. No. 6,591349, filed Aug. 31, 
2000, "System For Minimizing Memory Bank Conflicts in 50 
A Computer System," Ser. No. 09/652325, now U.S. Pat. 
No. 6,622,225, filed Aug. 31, 2000, "Computer Resource 
Management And Allocation System" Ser. No. 09/651,945, 
filed Aug. 31, 2000, "Input Data Recovery Scheme," Ser. 
No. 09/653,643, now U.S. Pat. No. 6,668335, filed Aug. 31, 55 
2000, "Fast Lane Prefetching," Ser. No. 09/652,451, now 
U.S. Pat. No. 6,681,295, filed Aug. 31, 2000, "A Mechanism 
For Synchronizing Multiple Skewed Source-Synchronous 
Data Channels With Automatic Initialization Feature," Ser. 
No. 09/652,480, now U.S. Pat. No. 6,636,955, filed Aug. 31, 60 
2000, "A Mechanism To Control The Allocation Of An 
N-Source Shared Buffer," Ser. No. 09/651,924, filed Aug. 
31, 2000, and "Chaining Directory Reads And Writes To 
Reduce DRAM Bandwidth In A Directory Based 
CC-NUMA Protocol," Ser. No. 09/652315, now U.S. Pat. 65 
No. 6,546,465, filed Aug. 31, 2000, all of which are incor- 
porated by reference herein. 
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STATEMENT REGARDING FEDERALLY 
SPONSORED RESEARCH OR DEVELOPMENT 

Not applicable. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention generally relates to a pipelined, 
superscalar microprocessor. More particularly, the invention 
relates to multi-processor memory cache coherency and a 
scheme for delivering cache invalidate requests and receiv- 
ing invalidate acknowledgements in a scalable multi- 
processor environment. 

2. Background of the Invention 

It often is desirable to include multiple processors in a 
single computer system. This is especially true for compu- 
tationally intensive applications and applications that other- 
wise can benefit from having more than one processor 
simultaneously performing various tasks. It is not uncom- 
mon for a multi-processor system to have 2 or 4 or more 
processors working in concert with one another. Typically, 
each processor couples to at least one and perhaps three or 
four other processors. 

Such systems usually require data and commands (e.g., 
read requests, write requests, etc.) to be transmitted from one 
processor to another. Furthermore, the processors may be 
executing tasks and working on identical problems which 
requires that data be shared among the processors. This data 
is commonly stored in a memory location that may be 
adjacent to each processor or may be located in a distinctly 
separate location. In either event, the processor must access 
the data from memory. If the memory is some distance away 
from the processor, delays are incurred as the data request is 
transmitted to a memory controller and the data is transmit- 
ted back to the processor. To alleviate this type of problem, 
a memory cache may be coupled to each processor. The 
memory cache is used to store "local" copies of data that is 
"permanently" stored at the master memory location. Since 
the data is local, fetch and retrieve times are reduced thereby 
decreasing execution times. The memory controller may 
distribute copies of that same data to other processors as 
needed. 

Successful implementation of this type of memory struc- 
ture requires a method of keeping track of the copies of data 
that are delivered to the various cache blocks. Furthermore, 
it may be necessary for a processor to alter the data in the 
local cache. In this scenario, the processor must determine if 
the data in question is an exclusive copy of the data. That is, 
the data in the local cache must be the only "copy" of the 
data outside of the main memory location. If the data is 
exclusive, the processor may write to the data block. If the 
data is shared (i.e., one of at least two copies of data outside 
the main memory location), the processor must first request 
and gain exclusive rights to the data before the data can be 
altered. When the memory controller receives an exclusive 
request, various techniques exist for notifying other proces- 
sors that there is an exclusive request pending for that 
particular data block. 

The particular technique chosen depends on the cache 
coherency protocol implemented for that particular multi- 
processor system. Cache coherency, in part, means that only 
one microprocessor can modify any part of the data at any 
one time, otherwise the state of the system would be 
nondeterministic. Before exclusive rights to the data block 
may be granted to the requestor, any other copies of that data 
block must be invalidated. In one example of a cache 
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coherency protocol, the memory controller will broadcast an shared data in a computer system. The plurality of proces- 
invalidate request to each processor in the system, regardless sors may be grouped into a plurality of clusters. A directory 
of whether or not the processors have a copy of the data controller tracks copies of shared data sent to processors in 
block. This approach tends to require less bookkeeping since the clusters. This tracking is accomplished using a share 
the memory controller and processors do not need to keep 5 mask data register that contains at least as many bit locations 
track of how many copies of data exist in the memory as there are clusters. When a block of data from main 
structure. However, bandwidth is hindered because proces- memory is distributed to a processor, the directory controller 
sors must check to see if there is a local copy of the data will set a bit in the share mask corresponding to the cluster 
block each time the processor receives an invalidate request. in which the sharing processor is located. Upon receiving an 
Another conventional cache coherency protocol is a direc- 10 exclusive request from a processor requesting permission to 
tory based protocol. In this type of system, the memory modify a shared copy of the data, the directory controller 
controller keeps a master list, or directory, of the data in generates invalidate messages requesting that other proces- 
main memory. When copies of the data are distributed to the sors sharing the same data invalidate that data. These 
individual processors, the memory controller will note the invalidate messages are sent via a point-to-point transmis- 
processor to which the data was sent and the status of that 15 sion only to master processors in clusters actually containing 
data. When an exclusive ownership request comes from a a shared ^PY of tne data * u P on receiving the invalidate 
processor, the memory controller sends the invalidate message, the master processors broadcast the invalidate 
requests only to the processors that have copies of the same message in an ordered fan -in/fan-out process to each pro- 
block of data. Contrary to the broadcast coherency method cessor .in the cluster. The path by which the invalidate 
described above, bandwidth is conserved by limiting invali- 20 messages are broadcast within a cluster is determined by 
date traffic to those processors which have a copy of a data control and status registers associated with each processor in 
block in the local cache. The performance benefits that result the system. These registers include configuration informa- 
from a directory based coherence protocol come at the tion whicn establishes to which processors, if any, a pro- 
expense of more overhead in terms of storage and memory cessor should forward the broadcast invalidate message. All 
required to store and update the directory. For instance, a 25 processors within the cluster invalidate a local copy of the 
share mask may be needed to successfully keep track of shared data if it exists and if the processor is not a requestor, 
those processors which have a copy of a data block. A share The processors then send acknowledgement messages to the 
mask may be a data register with as many bit locations as processor from which the invalidate message was received, 
there are processors in the system. When a copy of data is Once the master processor receives acknowledgements from 
delivered to a processor, the memory (or directory) control- 30 a11 processors in the cluster, the master processor sends an 
ler may set a bit in a location within the register correspond- invalidate acknowledgment message to the processor that 
ing to that processor. Thus, when an invalidate request needs originally requested the exclusive rights to the shared data, 
to be sent, the controller will send the request only to those The cache coherency is scalable and may be implemented 
processors corresponding to the bits that are set in the share usin g the hybrid point-to-point/broadcast scheme or a con- 
mask. With design forethought and resource allocation, a 35 ventional point-to-point only directory-based invalidate 
directory based cache coherency may be implemented in scheme. A PID-SHIFT register holds configuration informa- 
multi-processor systems of varying size. l i° n lriat determines which implementation shall be used. If 

* li , , # M i j * *u the PID-SHIFT register holds the value zero, a conventional 
A problem arises however, when systems are scaled to the . . 6 ,., , .„ , ' , _ 

• ; . t . *u *u * * u- u point-to-point invalidate scheme will be used, bor other 
point where there are more processors than that for which F , ^ , \™ ™7 T ™ . , , . , 
the directory structure can account. For example, a share 40 valu « ID the.PID-SHIFTreg.ster, the value determines the 
mask may include twenty bit locations in the data register, a ^ ° f P™ cesS£ ^ « rou Pf d per u C Vf r aD ^ estabhshes 
but a system may be designed with thirty-two microproces- ,hat the h y bnd mvalldat? shaI1 be used - 

sors. In this example it would be difficult if not impossible, BRIEF DESCRIPTION OF THE DRAWINGS 
to keep track of the shared data blocks in all of the processor 

memory caches. Similarly, system designers may con- 45 For a detailed description of the preferred embodiments of 

sciously desire to keep the directory structure overhead at a the invention, reference will now be made to the accompa- 

certain size while increasing the processor capability of the nying drawings in which: 

system. The limited nature of this shared directory structure FIG. 1 shows a system diagram of a plurality of micro- 
should not limit the size of the multi-processor system. ^ processors COU pled together; 

It is desirable therefore, to develop a scalable, directory- FIGS 2a and 2b show a block diagram of the micropro- 

based cache coherency that may be used in multi-processor cessors of FIG. 1* 

systems of varying sizes. The cache coherency distributes ™~ - , ' 4 c • f *. r • 

J B . 1M . i j • FIG. 3 shows a system diagram of a plurality of micro- 
invalidate messages much like a conventional directory nr . ~a t™,u„ ^T^i,,^,^. 

, , , & r , ' processors grouped together in clusters; 

based coherency for small systems and operates using a „ c ° r ° 

hybrid directory and broadcast based invalidation scheme FIG - 4 shows a broadcast invalidate distribution scheme 

for larger systems. The invention may advantageously pro- for a cluster of microprocessors; and 

vide system designers flexibility in implementing the cache FIG. 5 shows a broadcast invalidate distribution scheme 

coherency. The cache coherency scheme may also advanta- for a cluster of microprocessors where the invalidate request 

geously reduce system cost by allowing a standard coher- 6Q node, directory node, and a sharing node exist in the same 

ency platform to be delivered with product lines of varying cluster. 



size. 

BRIEF SUMMARY OF THE INVENTION 



NOTATION AND NOMENCLATURE 



Certain terms are used throughout the following descrip- 
The problems noted above are solved in large part by a 65 tion and claims to refer to particular system components. As 
directory-based multiprocessor cache control system for one skilled in the art will appreciate, computer companies 
distributing invalidate messages to change the state of may refer to a component by different names. This document 
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does not intend to distinguish between components that example, a block of data that resides in the main memory 

differ in name but not function. In the following discussion coupled to processor 107. In this context, processor 107 may 

and in the claims, the terms "including" and "comprising" be considered the owner of this particular block of data, 

are used in an open-ended fashion, and thus should be Furthermore, with regards to this particular block of data, 

interpreted to mean "including, but not limited to . . . ". Also, 5 processor 107 may also be considered the directory proces- 

the term "couple" or "couples" is intended to mean either an s ° r or directory node. Assume also that shared copies of the 

indirect or direct electrical connection. Thus, if a first device data block reside in the cache memory for processors 100 

couples to a second device, that connection may be through a ^ 108 - If processor 100 needs to modify the shared block 

tU u • t of data, processor 100 will transmit a request for exclusive 

a direct electrical connection, or through an indirect elec- QWM jJ of the data tQ the data ownerj lessor 107 . The 

trical connection via other devices and connections. 10 memory ^ aQager fof processor 1(r7 may preferab i y have 

DETAILED DESCRIPTION OF THE control of * sh * TC masl < wh ? ch f ° rms ? art ° f a ™ here ° ce 

nnrrrnncn c^r^™* , cxrro directory which is stored with each block ot data. Ine snare 

PREFERRED EMBODIMENTS . . 3 . , - , t . . ... 4 , . A ~ ... , 

mask is comprised of a data register with at least 12 bits (one 

Referring now to FIG. 1, in accordance with the preferred for each processor in FIG. 1). In the preferred embodiment, 

embodiment of the invention, computer system 90 com- 15 the share mask is a 20 bit data register. In the example 

prises one or more processors 100 coupled to a memory 102 system shown in FIG. 1, only 12 of the 20 bit locations 

and an input/output ("I/O") controller 104. As shown, com- would be used and two of the 12 bits (corresponding to 

puter system 90 includes twelve processors 100, each pro- processors 100 and 108) are set indicating that processors 

cessor coupled to a memory and an I/O controller. Each 10 ? and J 08 share the block of data. Processor 107 prefer- 

processor preferably includes four ports for connection to 20 ably sends a response back to processor 100 indicating the 

. r rpi • . 4 i • t . number of shared copies of the data in existence. In this 

adi ace nt processors. The mterprocessor ports are designated f . . r . , , - . , , 

"North " "South " "East " and "West" in accordance with the exam P ie > there 15 onl y one other sbared COPY ° f the data 

North, fcoutn , bast, and west in accordance witn tne block . Upon receiv i n g the response from the directory node, 

well-known Manhattan gnd architecture also known as a ocessor 100 may pre ferably change the state of the shared 

crossbar interconnection network architecture. As such, each data Wock tQ exclusivei However, processor 100 must wait 

processor 100 can be connected to four other processors. tQ receive one acknowledgment before it can modify the 

The processors on both ends of the system layout wrap data block In an alternative embodiment, processor 100 

around and connect to processors on the opposite side to may wa { t until it receives all outstanding acknowledgments 

implement a 2D torus-type connection. Although twelve De f ore ; t changes the state of the shared data block to 

processors 100 are shown in the exemplary embodiment of exclusive. 

FIG. 1, any desired number of processors (e.g., 256) can be 30 p roC essor 107 also preferably transmits a Sharelnvai 

included. For purposes of the following discussion, the request to processor i0 8. The Sharelnvai message is a 

processor in the upper, left-hand comer of FIG. 1 will be command to change the status of a shared data block to 

discussed with the understanding that the other processors invalid. In response to the Sharelnvai request, processor 108 
100 are similarly configured in the preferred embodiment. ^ ^ change the state of the shared data block from shared to 

As noted, each processor preferably has an associated I/O invalid and preferably transmit an invalidate 

controller 104. The I/O controller 104 provides an interface acknowledgment, InvalAck, to the original exclusive 

to various input/output devices such as disk drives 105 and requester, processor 100. Upon receiving the one expected 

106, as shown in the lower, left-hand corner of FIG. 1. Data InvalAck signal, processor 100 may then write to the exclu- 

from the I/O devices thus enters the 2D torus via the I/O 4Q s ive copy of the data block. In general, the requesting 

controllers. processor 100 must wait for all acknowledgements, the 

Each processor also, preferably, has an associated number of which is indicated by the directory controller 107, 

memory 102. In accordance with the preferred embodiment, before modifying the exclusive data block, 

the memory 102 preferably comprises RAMbus™ memory Referring now to FIGS. 2A and 2B, each processor 100 

devices, but other types of memory devices can be used, if 45 preferably includes an instruction cache 110, an instruction 

desired. The capacity of the memory devices 102 can be any fetch, issue and retire unit ("Ibox") 120, an integer execution 

suitable size. Further, memory devices 102 preferably are unit ("Ebox") 130, a floating-point execution unit ("Fbox") 

implemented as Rambus Interface Memory Modules 140, a memory reference unit ("Mbox") 150, a data cache 

("RIMM"). 160, an L2 instruction and data cache control unit ("Cbox") 

In general, computer system 90 can be configured so that 50 170, a level L2 cache 180, two memory controllers 

any processor 100 can access its own memory 102 and I/O ("ZboxO" and "Zboxl") 190, and an interprocessor and I/O 

devices, as well as the memory and I/O devices of all other router unit ("Rbox") 200. The following discussion 

processors in the system. Preferably, the computer system describes each of these units. 

may have physical connections between each processor Each of the various functional units 110-200 contains 

resulting in low interprocessor communication times and 55 control logic that communicates with the control logic of 

improved memory and I/O device access reliability. If various other functional units, control logic as shown. The 

physical connections are not present between each pair of instruction cache control logic 110 communicates with the 

processors, a pass-through or bypass path is preferably Ibox 120, Cbox 170, and L2 Cache 180. In addition to the 

implemented in each processor that permits accesses to a control logic communicating with the instruction cache 110, 

processor's memory and I/O devices by another processor go the Ibox control logic 120 communicates with Ebox 130, 

through one or more pass-through processors. Fbox 140 and Cbox 170. The Ebox 130 and Fbox 140 

Referring still to FIG. 1, a conventional directory-based, control logic both communicate with the Mbox 150, which 

share invalidate scheme may be implemented in the multi- in turn communicates with the data cache 160 and Cbox 170. 

processor system shown. In FIG. 1, memory is distributed The Cbox control logic also communicates with the L2 

about all the processors 100 in the multiprocessor system. 65 cache 180, Zboxes 190, and Rbox 200. 

Thus, each processor includes a memory manager and Referring still to FIGS. 2a and 26, the Ibox 120 preferably 

directory structure for the local memory. Consider for includes a fetch unit 121 which contains a virtual program 
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counter ("VPC") 122, a branch predictor 123, an instruction- the predecoder 125. The branch prediction scheme imple- 
stream translation buffer 124, an instruction predecoder 125, mented in branch predictor 123 generally works most effi- 
a retire unit 126, decode and rename registers 127, an integer ciently when only one branch instruction is contained among 
instruction queue 128, and a floating point instruction queue the four fetched instructions. The predecoder 125 predicts 
129. Generally, the VPC 122 maintains virtual addresses for 5 the instruction cache line that the branch predictor 123 will 
instructions that are in flight. An instruction is said to be generate. The predecoder 125 generates fetch requests for 
"in-flight" from the time it is fetched until it retires or aborts. additional instruction cache lines and stores the instruction 
The Ibox 120 can accommodate as many as 80 instructions, stream data in the instruction cache, 
in 20 successive fetch slots, in flight between the decode and Referring still to FIGS. 2a and 2b, the retire unit 126 
rename registers 127 and the end of the pipeline. The VPC 1Q fetches instructions in program order, executes them out of 
preferably includes a 20-entry table to store these fetched order, and then retires (also called "committing" an 
VPC addresses. instruction) them in order. The Ibox 120 logic maintains the 
With regard to branch instructions, the Ibox 120 uses the architectural state of the processor by retiring an instruction 
branch predictor 123. A branch instruction requires program only if all previous instructions have executed without 
execution either to continue with the instruction immedi- 15 generating exceptions or branch mispredictions. An excep- 
ately following the branch instruction if a certain condition tion is any event that causes suspension of normal instruc- 
is met, or branch to a different instruction if the particular tion execution. Retiring an instruction commits the proces- 
condition is not met. Accordingly, the outcome of a branch sor to any changes that the instruction may have made to the 
instruction is not known until the instruction is executed. In software accessible registers and memory. The processor 
a pipelined architecture, a branch instruction (or any instruc- 20 100 preferably includes the following three machine code 
tion for that matter) may not be executed for at least several, accessible hardware: integer and floating-point registers, 
and perhaps many, clock cycles after the fetch unit in the memory, internal processor registers. The retire unit 126 of 
processor fetches the branch instruction. In order to keep the the preferred embodiment can retire instructions at a sus- 
pipeline full, which is desirable for efficient operation, the tained rate of eight instructions per cycle, and can retire as 
processor includes branch prediction logic that predicts the 25 many as 11 instructions in a single cycle, 
outcome of a branch instruction before it is actually The decode and rename registers 127 contains logic that 
executed (also referred to as "speculating"). The branch forwards instructions to the integer and floating-point 
predictor 123, which receives addresses from the VPC queue instruction queues 128, 129. The decode and rename regis- 
122, preferably bases its speculation on short and long-term ters 127 perform preferably the following two functions, 
history of prior instruction branches. As such, using branch 30 First, the decode and rename registers 127 eliminates reg- 
prediction logic, a processor's fetch unit can speculate the ister write -after-read ("WAR") and write -after- write 
outcome of a branch instruction before it is actually ("WAW") data dependency while preserving true read-after- 
executed. The speculation, however, may or may not turn write ("RAW") data dependencies. This permits instructions 
out to be accurate. That is, the branch predictor logic may to be dynamically rescheduled. Second, the decode and 
guess wrong regarding the direction of program execution 35 rename registers 127 permits the processor to speculatively 
following a branch instruction. If the speculation proves to execute instructions before the control flow previous to 
have been accurate, which is determined when the processor those instructions is resolved. 

executes the branch instruction, then the next instructions to The logic in the decode and rename registers 127 prefer- 

be executed have already been fetched and are working their a bly translates each instruction's operand register specifiers 

way through the pipeline. 40 from the virtual register numbers in the instruction to the 

If, however, the branch speculation performed by the physical register numbers that hold the corresponding 

branch predictor 123 turns out to have been the wrong architecturally-correct values. The logic also renames each 

prediction (referred to as "misprediction" or instruction destination register specifier from the virtual 

"misspeculation"), many or all of the instructions behind the number in the instruction to a physical register number 

branch instruction may have to be flushed from the pipeline 45 chosen from a list of free physical registers, and updates the 

(i.e., not executed) because of the incorrect fork taken after register maps. The decode and rename register logic can 

the branch instruction. Branch predictor 123 uses any suit- process four instructions per cycle. Preferably, the logic in 

able branch prediction algorithm, however, that results in the decode and rename registers 127 does not return the 

correct speculations more often than misspeculations, and physical register, which holds the old value of an instruc- 

the overall performance of the processor is better (even in 50 tion's virtual destination register, to the free list until the 

the face of some misspeculations) than if speculation was instruction has been retired, indicating that the control flow 

turned off. up to that instruction has been resolved. 

The instruction translation buffer ("ITB") 124 couples to If a branch misprediction or exception occurs, the register 

the instruction cache 110 and the fetch unit 121. The ITB logic backs up the contents of the integer and floating-point 

124 comprises a 128-entry, fully associative instruction- 55 rename registers to the state associated with the instruction 

stream translation buffer that is used to store recently used that triggered the condition, and the fetch unit 121 restarts at 

instruction-stream address translations and page protection the appropriate Virtual Program Counter ("VPC"). 

information. Preferably, each of the entries in the ITB 124 Preferably, as noted above, twenty valid fetch slots contain- 

may be 1, 8, 64 or 512 contiguous 8-kilobyte ("KB") pages ing up to eighty instructions can be in flight between the 

or 1, 32, 512, 8192 contiguous 64-kilobyte pages. The 60 registers 127 and the end of the processor's pipeline, where 

allocation scheme used for the ITB 124 is a round-robin control flow is finally resolved. The register 127 logic is 

scheme, although other schemes can be used as desired. capable of backing up the contents of the registers to the 

The predecoder 125 reads an octaword (16 contiguous state associated with any of these 80 instructions in a single 

bytes) from the instruction cache 110. Each octaword read cycle. The register logic 127 preferably places instructions 

from instruction cache may contain up to four naturally 65 into the integer or floating-point issue queues 128, 129, from 

aligned instructions per cycle. Branch prediction and line which they are later issued to functional units 130 or 136 for 

prediction bits accompany the four instructions fetched by execution. 
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The integer instruction queue 128 preferably includes 
capacity for twenty integer instructions. The integer instruc- 
tion queue 128 issues instructions at a maximum rate of four 
instructions per cycle. The specific types of instructions 
processed through queue 128 include: integer operate 5 
commands, integer conditional branches, unconditional 
branches (both displacement and memory formats), integer 
and floating-point load and store commands, Privileged 
Architecture Library ("PAL") reserved instructions, integer- 
to-floating-point and floating-point-integer conversion com- 1Q 
mands. 

Referring still to FIGS, la and 2b, the integer execution 
unit ("Ebox") 130 includes arithmetic logic units ("ALUs") 
131, 132, 133, and 134 and two integer register files 135. 
Ebox 130 preferably comprises a 4-path integer execution 15 
unit that is implemented as two functional-unit "clusters" 
labeled 0 and 1. Each cluster contains a copy of an 80-entry, 
physical -register file and two subclusters, named upper 
("U") and lower ("L"). As such, the subclusters 131-134 are 
labeled U0, L0, Ul, and LI. Bus 137 provides cross-cluster 2Q 
communication for moving integer result values between the 
clusters. 

The subclusters 131-134 include various components that 
are not specifically shown in FIG. 2a. For example, the 
subclusters preferably include four 64-bit adders that are 2 s 
used to calculate results for integer add instructions, logic 
units, barrel shifters and associated byte logic, conditional 
branch logic, a pipelined multiplier for integer multiply 
operations, and other components known to those of ordi- 
nary skill in the art. 30 

Each entry in the integer instruction queue 128 preferably 
asserts four request signals — one for each of the Ebox 130 
subclusters 131, 132, 133, and 134. A queue entry asserts a 
request when it contains an instruction that can be executed 
by the subcluster, if the instruction's operand register values 35 
are available within the subcluster. The integer instruction 
queue 128 includes two arbiters — one for the upper sub- 
clusters 132 and 133 and another arbiter for the lower 
subclusters 131 and 134. Each arbiter selects two of the 
possible twenty requesters for service each cycle. Preferably, 40 
the integer instruction queue 128 arbiters choose between 
simultaneous requesters of a subcluster based on the age of 
the request — older requests are given priority over newer 
requests. If a given instruction requests both lower 
subclusters, and no older instruction requests a lower 45 
subcluster, then the arbiter preferably assigns subcluster 131 
to the instruction. If a given instruction requests both upper 
subclusters, and no older instruction requests an upper 
subcluster, then the arbiter preferably assigns subcluster 133 
to the instruction. 50 

The floating-point instruction queue 129 preferably com- 
prises a 15-entry queue and issues the following types of 
instructions: floating-point operates, floating-point condi- 
tional branches, floating-point stores, and floating-point reg- 
ister to integer register transfers. Each queue entry prefer- 55 
ably includes three request lines— one for the add pipeline, 
one for the multiply pipeline, and one for the two store 
pipelines. The floating-point instruction queue 129 includes 
three arbiters — one for each of the add, multiply, and store 
pipelines. The add and multiply arbiters select one requester 60 
per cycle, while the store pipeline arbiter selects two 
requesters per cycle, one for each store pipeline. As with the 
integer instruction queue 128 arbiters, the floating-point 
instruction queue arbiters select between simultaneous 
requesters of a pipeline based on the age of the request — 65 
older request are given priority. Preferably, floating-point 
store instructions and floating-point register to integer reg- 
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ister transfer instructions in even numbered queue entries 
arbitrate for one store port. Floating-point store instructions 
and floating-point register to integer register transfer instruc- 
tions in odd numbered queue entries arbitrate for the second 
store port. 

Floating-point store instructions and floating-point 
register-to-integer-register transfer instructions are queued 
in both the integer and floating-point queues. These instruc- 
tions wait in the floating-point queue until their operand 
register values are available from the floating — point execu- 
tion unit ("Fbox") registers. The processor executing these 
instructions subsequently request service from the store 
arbiter. Upon being issued from the floating-point queue 
129, the processor executing these instructions signal the 
corresponding entry in the integer queue 128 to request 
service. Finally, the operation is complete after the instruc- 
tion is issued from the integer queue 128. 

The integer registers 135, 136 preferably contain storage 
for the processor's integer registers, results written by 
instructions that have not yet been retired, and other infor- 
mation as desired. The two register files 135, 136 preferably 
contain identical values. Each register file preferably 
includes four read ports and six write ports. The four read 
ports are used to source operands to each of the two 
subclusters within a cluster. The six write ports are used to 
write results generated within the cluster or another cluster 
and to write results from load instructions. 

The floating-point execution queue ("Fbox") 129 contains 
a floating-point add, divide and square -root calculation unit 
142, a floating-point multiply unit 144 and a register file 146. 
Floating-point add, divide and square root operations are 
handled by the floating-point add, divide and square root 
calculation unit 142 while floating-point operations are 
handled by the multiply unit 144. 

The register file 146 preferably provides storage for 
seventy-two entries including thirty-one floating-point reg- 
isters and forty-one values written by instructions that have 
not yet been retired. The Fbox register file 146 contains six 
read ports and four write ports (not specifically shown). Four 
read ports are used to source operands to the add and 
multiply pipelines, and two read ports are used to source 
data for store instructions. Two write ports are used to write 
results generated by the add and multiply pipelines, and two 
write ports are used to write results from floating-point load 
instructions. 

Referring still to FIG. 2a y the Mbox 150 controls the LI 
data cache 160 and ensures architecturally correct behavior 
for load and store instructions. The Mbox 150 preferably 
contains a datastream translation buffer ("DTB") 151, a load 
queue ("LQ") 152, a store queue ("SQ") 153, and a miss 
address file ("MAF") 154. The DTB 151 preferably com- 
prises a fully associative translation buffer that is used to 
store data stream address translations and page protection 
information. Each of the entries in the DTB 151 can map 1, 
8, 64, or 512 contiguous 8-KB pages. The allocation scheme 
preferably is round robin, although other suitable schemes 
could also be used. The DTB 151 also supports an 8-bit 
Address Space Number ("ASN") and contains an Address 
Space Match ("ASM") bit. The ASN is an optionally imple- 
mented register used to reduce the need for invalidation of 
cached address translations for process-specific addresses 
when a context switch occurs. 

The LQ 152 preferably is a reorder buffer used for load 
instructions. It preferably contains thirty-two entries and 
maintains the state associated with load instructions that 
have been issued to the Mbox 150, but for which results have 
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not been delivered to the processor and the instructions 
retired. The Mbox 150 assigns load instructions to LQ slots 
based on the order in which they were fetched from the 
instruction cache 110, and then places them into the LQ 152 
after they are issued by the integer instruction queue 128. 5 
The LQ 152 also helps to ensure correct memory reference 
behavior for the processor. 

The SQ 153 preferably is a reorder buffer and graduation 
unit for store instructions. It preferably contains thirty-two 
entries and maintains the state associated with store instruc- 10 
tions that have been issued to the Mbox 150, but for which 
data has not been written to the data cache 160 and the 
instruction retired. The Mbox 150 assigns store instructions 
to SQ slots based on the order in which they were fetched 
from the instruction cache 110 and places them into the SQ 15 
153 after they are issued by the instruction cache 110. The 
SQ 153 holds data associated with the store instructions 
issued from the integer instruction unit 128 until they are 
retired, at which point the store can be allowed to update the 
data cache 160. The LQ 152 also helps to ensure correct 20 
memory reference behavior for the processor. The miss 
address file ("MAF") 154 preferably comprises a 16 -entry 
file that holds physical addresses associated with pending 
instruction cache 110 and data cache 160 fill requests and 
pending input/output ("I/O") space read transactions. 2 s 

Processor 100 preferably includes two on-chip primary- 
level ("LI") instruction and data caches 110 and 160, and a 
single secondary-level, unified instruction/data ("L2") cache 
180 (FIG. 26). The LI instruction cache 110 preferably 
comprises a 64-KB virtual-addressed, two-way set- 30 
associative cache. Prediction of future instruction execution 
is used to improve the performance of the two-way set- 
associative cache without slowing the cache access time. 
Each instruction cache block preferably contains a plurality 
(preferably 16) instructions, virtual tag bits, an address space 35 
number, an address space match bit, a one-bit PALcode bit 
to indicate physical addressing, a valid bit, data and tag 
parity bits, four access-check bits, and predecoded informa- 
tion to assist with instruction processing and fetch control. 

The LI data cache 160 preferably comprises a 64 KB, 40 
two-way set associative, virtually indexed, physically 
tagged, write-back, read/write allocate cache with 64-byte 
cache blocks. During each cycle the data cache 160 prefer- 
ably performs one of the following transactions: two quad- 
word (or shorter) read transactions to arbitrary addresses, 45 
two quadword write transactions to the same aligned 
octaword, two non-overlapping less-than quadword writes 
to the same aligned quadword, one sequential read and write 
transaction from and to the same aligned octaword. 
Preferably, each data cache block contains 64 data bytes and 50 
associated quadword ECC bits, physical tag bits, valid, dirty, 
shared, and modified bits, tag parity bit calculated across the 
tag, dirty, shared, and modified bits, and one bit to control 
round-robin set allocation. The data cache 160 preferably is 
organized to contain two sets, each with 512 rows containing 55 
64-byte blocks per row (i.e., 32 KB of data per set). The 
processor 100 uses two additional bits of virtual address 
beyond the bits that specify an 8-KB page in order to specify 
the data cache row index. A given virtual address might be 
found in four unique locations in the data cache 160, 60 
depending on the virtual-to-physical translation for those 
two bits. The processor 100 prevents this aliasing by keeping 
only one of the four possible translated addresses in the 
cache at any time. 

As will be understood by one skilled in the art, the L2 65 
cache 180 comprises a secondary cache for the processor 
100, which typically is implemented on a separate chip. The 
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L2 cache 180 preferably comprises a 1.75-MB, seven-way 
set associative write -back mixed instruction and data cache. 
Preferably, the L2 cache holds physical address data and 
coherence state bits for each block. 

Referring now to FIG. 2b, the L2 instruction and data 
cache control unit ("Cbox") 170 controls the L2 instruction 
and data cache 190 and system ports. As shown, the Cbox 
170 contains a fill buffer 171, a data cache victim buffer 172, 
a system victim buffer 173, a cache miss address file 
("CMAF") 174, a system victim address file ("SVAF') 175, 
a data victim address file ("DVAF") 176, a probe queue 
("PRBQ") 177, a requester miss-address file ("RMAF") 178, 
a store to I/O space ("STIO") 179, and an arbitration unit 
181. 

The fill buffer 171 in the Cbox preferably buffers data 
received from other functional units outside the Cbox 170. 
The data and instructions get written into the fill buffer 171 
and other logic units in the Cbox 170 process the data and 
instructions before sending to another functional unit or the 
LI cache 110 and 160. The data cache victim buffer ("VDF") 
172 preferably stores data flushed from the LI cache 110 and 
160 or sent to the System Victim Data Buffer 173. The 
System Victim Data Buffer ("SVDB") 173 sends data 
flushed from the L2 cache to other processors in the system 
and to memory. Cbox Miss-Address File ("CMAF") 174 
preferably holds addresses of LI cache misses. CMAF 174 
updates and maintains the status of these addresses. The 
System Victim-Address File ("SVAF") 175 in the Cbox 170 
preferably contains the addresses of all SVDB data entries.. 
Data Victim-Address File ("DVAF") 176 preferably con- 
tains the addresses of all data cache victim buffer ("VDF") 
172 data entries. 

The Probe Queue ("PRBQ") 177 preferably comprises a 
18-entry queue that holds pending system port cache probe 
commands and addresses. The Probe Queue 177 includes 10 
remote request entries, 8 forward entries, and lookup L2 tags 
and requests from the PRBQ content addressable memory 
("CAM") against the RMAF, CMAF and SVAF. Requestor 
Miss-Address Files ("RMAF") 178 in the Cbox 170 pref- 
erably accepts requests and responds with data or instruc- 
tions from the L2 cache. Data accesses from other functional 
units in the processor, other processors in the computer 
system or any other devices that might need data out of the 
L2 cache are sent to the RMAF 178 for service. The Store 
Input/Output ("STIO") 179 preferably transfer data from the 
local processor to I/O cards in the computer system. Finally, 
arbitration unit 181 in the Cbox 170 preferably arbitrates 
between load and store accesses to the same memory 
location of the L2 cache and informs other logic blocks in 
the Cbox and computer system functional units of the 
conflict. 

Referring still to FIG. 2b, processor 100 preferably 
includes dual, integrated RAMbus™ memory controllers 
190 (ZboxO and Zboxl). Each Zbox 190 controls 4 or 5 
channels of information flow with the main memory 102 
(FIG. 1). Each Zbox 190 preferably includes a front-end 
directory in flight table ("DIFT*) 191, a middle mapper 192, 
and a back end 193. The front-end DIFT 191 performs a 
number of functions such as managing the processor's 
directory -based memory coherency protocol, processing 
request commands from the Cbox 170 and Rbox 200, 
sending forward commands to the Rbox 200, sending 
response commands to and receiving packets from the Cbox 
170 and Rbox 200, and tracking up to thirty-two in-flight 
transactions. The front-end DIFT 191 also sends directory 
read and write requests to the Zbox 190 and conditionally 
updates directory information based on request type, Local 
Probe Response ("LPR") status and directory state. 
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The middle mapper 192 maps the physical address into 
RAMbus™ device format by device, bank, row, and col- 
umn. The middle mapper 192 also maintains an open-page 
table to track all open pages and to close pages on demand 
if bank conflicts arise. The mapper 192 also schedules 5 
RAMbus™ transactions such as timer-base request queues. 
The Zbox back end 193 preferably packetizes the address, 
control, and data into RAMbus™ format and provides the 
electrical interface to the RAMbus™ devices themselves. 

The Rbox 200 provides the interfaces to as many as four 10 
other processors and one I/O controller 104 (FIG. 1). The 
inter-processor interfaces are designated as North ("N"), 
South ("S"), East ("E"), and West ("W") and provide 
two-way communication between adjacent processors. The 
Rbox 200 also includes configuration and status registers 15 
("CSR") 195 that govern distribution of broadcast invalidate 
messages. Description of the broadcast invalidate distribu- 
tion is discussed in further detail below. 

In the preferred embodiment, the directory information 
within the DIFT 191 in the Zbox 190 includes the share 20 
mask that is used to track shared copies of data outside of 
main memory. The directory also comprises a configuration 
register that determines which implementation of the cache 
coherency invalidate scheme is currently in use. This con- 
figuration register is preferably called the ZBOX*_DIFT_ 25 
CTL[PIDSHIFT] register or simply the PID-SHIFT register. 
If the value of the PID-SHIFT register is set to zero, the 
multi-processor system will operate with a conventional 
invalidate scheme as described above. For other values of 
this register, n, the multi-processor system will operate with 30 
a hybrid invalidate scheme. The term "hybrid" is used to 
indicate that the cache coherency distributes invalidate 
requests using both a directory based point-to-point trans- 
mission and a broadcast transmission. 

35 

FIG. 3 represents a multi-processor system configured to 
use the hybrid invalidate scheme. Whereas the system 
shown in FIG. 1 comprises 12 processors 100, the system 
shown in FIG. 3 comprises 12 clumps or clusters 300. 
Within each cluster, there are four processors 310. The 4Q 
number of clusters in a system is determined by the size of 
the share mask. As discussed above, the share mask for the 
multi-processor system shown in FIG. 1 forms a part of the 
directory structure for each cache block and is preferably 
comprised of a 20-bit data register. In the example system 45 
shown in FIG. 1, 12 of the 20 bits are used to track share 
locations of a data block. That same share mask may be used 
for the system shown in FIG. 3. In the system shown in FIG. 
3, a bit in the share mask no longer corresponds to a 
particular processor or node, but rather to a cluster number. 5Q 
The size of the share mask therefore determines the number 
of processors or clusters that may exist in the multi- 
processor system. 

The number of processors in a clustered system is deter- 
mined by the value in the PID-SHIFT register. For a 55 
non-zero value, n, in the PID-SHIFT register, there are 2" 
processors in each cluster. For the system in FIG. 3, the 
PID-SHIFT register would hold the value two, which cor- 
responds to four processors per cluster. 

FIG. 4 shows an exemplary cluster 420 comprised of 16 60 
processors. Within each cluster, one of the processors is 
designated a master 400. The remaining processors are 
designated slaves 410. The master 400 is the central hub 
through which all invalidate and acknowledgment messages 
in a given cluster must travel. If the master or at least one of 65 
the slaves 410 contains a shared data block in a local 
memory cache, the bit in the share mask corresponding to 
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that cluster will be set to indicate that a shared copy of the 
data block resides within that cluster. Description of signal 
propagation through the cluster is discussed below. 

Referring again to FIG. 3, the hybrid invalidate scheme 
has some aspects in common with the conventional directory 
based scheme discussed above. Consider a memory block 
residing in a memory coupled to processor 320 located 
within cluster 307. As above, processor 320 is therefore 
called the owner of the memory block may preferably be 
called the directory node. Consider also that copies of the 
memory block have been distributed to processors 310 and 
330 located within clusters 300 and 308, respectively. If 
processor 310 needs to modify the shared block of data, 
processor 310 will transmit a requests for exclusive owner- 
ship of the data block to the directory node 320. 

Directory node 320 will preferably respond with three 
separate messages. The first message is a response to 
requestor node 310 indicating there are two clusters that 
share a copy of the data block and that two InvalAck 
messages must be received prior to modifying the data 
block. Upon receiving this first message, the requester node 
310 preferably changes the state of the shared block of 
memory to exclusive. The second and third messages are 
broadcast share invalidate messages, SharelnvalBroadcast, 
that are sent to the sharing clusters. In the present example, 
the directory node knows that a shared copy of the data 
exists in clusters 308 and 300. Requestor node 310 in cluster 
300 has one copy of the data, but there may be another copy 
of the data in one of the other nodes in cluster 300. Hence, 
an invalidate message must be sent to cluster 300 as well as 
cluster 308. 

The SharelnvalBroadcast message differs from the Sha- 
relnval message defined above in that it is sent to the master 
processor in a cluster to indicate the need to broadcast an 
invalidate message to all processors in a cluster. In the 
preferred embodiment, the SharelnvalBroadcast message is 
sent only to clusters within which a shared copy of the data 
block exists. More specifically, this SharelnvalBroadcast 
message is sent only to the master node 340 of those clusters. 
The master processor 340 then distributes a broadcast invali- 
date message to every slave processor 330, 350 in the 
cluster. The slave processors 330, 350 preferably receive the 
broadcast invalidate message and, if a shared copy of the 
data block resides in the local cache and the processor is not 
a requester, the status of that data block is changed to invalid 
and an acknowledgement is sent back to the master node 
340. If the slave processor 350 does not have a shared copy 
of the data block in local cache, no action other than 
responding with an acknowledgement is taken. Once 
acknowledgments from all slave nodes 330, 350 in the 
cluster are received by the master node 340, the master node 
340 sends a single invalidate acknowledgement to the 
requestor node 310. 

In the present example of the preferred embodiment, the 
master node 340 of cluster 308 distributes a broadcast 
invalidate message to all slaves 350 including the sharing 
node 330. The non-sharing slave nodes 350 do not need to 
update the status of any cache blocks in response to this 
invalidate message and simply respond with an acknowl- 
edgment that the broadcast message was received. Sharing 
node 330 will update the status of the shared cache block to 
invalid and respond to its parent node with an acknowledg- 
ment. Upon receiving all slave acknowledgments, master 
processor 340 will change the status of its own shared copy 
of the data block, if it exists. Alternatively, the master 
processor my change the status of the a shared copy of the 
data block when it receives the SharelnvalBroadcast mes- 
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sage from the directory node 320. Master processor 340 will message should be propagated to the north, east and south 

then send a single InvalAck message to the requestor node output ports. The contents of these registers need not be the 

310. same as each other and may vary from processor to proces- 

ASharelnvalBroadcast message is also sent to the master sor. The RBOX_W_CFG[BRO] register may preferably 

processor 340 located in cluster 300. The invalidate message 5 comprise a 4 bit data register with each bit corresponding to 

is broadcast within this cluster in much the same way it is one of the four compass direction output ports. Message 

broadcast in cluster 308. The main difference in this cluster propagation is determined by bits that are set within this 

300 is that slave processor 310 is also the requester. Thus, register. If none of the bits are set in a given register, 

when processor 310 receives an invalidate request, the broadcast message propagation is not required. In such a 

request is ignored and an acknowledgement is sent to its 10 case, invalidation and acknowledgement are the only actions 

parent node as if the invalidate command was followed. required. Each slave will preferably include unique fan-in/ 

When the master node 340 in this cluster 300 receives all fan-out information stored in the CSR 195. The CSR 195 

acknowledgments from the slave nodes 310, 350, the master settings preferably reside in the router unit ("Rbox") 200. 

node 340 will send an InvalAck message to the requestor At each node, a new inval widget entry is allocated upon 

node 310. Upon receiving acknowledgements from the 15 receipt of the SpeciallnvaLBroadcast message 440. The 

master processors 340 in clusters 300 and 308, the requestor invalidate message is forwarded on to all children as indi- 

node 310 may then modify the exclusive data block. cated by settings in the CSR 195. The inval widget entry 

Referring again to FIG. 4, the transmission and distribu- waits for all children processors in its subtree to complete 

tion of the broadcast invalidate message will now be dis- before it completes. The inval widget also waits for the 

cussed. In order to avoid deadlock within a cluster, the 2 o invalidate on the local processor to complete. Thus, once the 

broadcast message is expected to be fanned out and fanned children of a given node complete the invalidate action and 

in using a specified path. The master node 400 is the root of the local invalidate action is complete, the inval widget entry 

the fan-out/fan-in tree. After receiving a SharelnvalBroad- is deallocated and a broadcast invalidate complete message 

cast message 430, the master processor 400 preferably is sent to the parent node. This process is repeated for each 

buffers the command in an internal structure called an inval 2 5 node in tne ciuster unti * tne master node 400 receives a 

widget. The master node 400 then transmits a Speciallnval- complete message from each of its children. Once the master 

Broadcast message 440 within the cluster. Like the Share- node 400 receives all complete messages and completes its 

Inval message denned above, the SpeciallnvalBroadcast own invalidate, the master node 400 will deallocate its own 

message is the command to change the status of a shared inval widget and transmit a single InvalAck message to the 

data block to invalid. It differs from the Sharelnval message 30 requesting processor 310. 

in that the Sharelnval is sent only to sharing nodes in a It should be noted that the fan out scheme depicted in FIG. 

point-to-point implementation of the invalidate scheme and 4 is only one of many possibilities. Other paths may be 

the SpeciallnvalBroadcast message is broadcast to all nodes selected based on factors such as optimization or consis- 

in a cluster in the hybrid invalidate scheme. tency. The description herein and the claim limitations are 

It should be noted that while only one master node 400 is 35 not intended to limit the scope of the intra-cluster broadcast 

shown in FIG. 4, any of the nodes in the cluster may act as propagation scheme. 

master node. Different directory structures in fact may FIG. 5 represents an application of the preferred embodi- 

preferably recognize different masters for any given cluster ment where a requestor node 500, directory node 510 and 

to alleviate traffic congestion problems that may arise if only sharing node 520 all reside within the same cluster 530. 

one master is used per cluster. Propagation of the broadcast 40 Upon receipt of an exclusive request from node 500, the 

invalidate signals may occur simultaneously because of the directory node 510 determines from the share mask that at 

unique propagation settings residing in the CSRs 195. Deter- least one shared copy of the data block are within the local 

mination of which processor in a cluster is the master is cluster. In this situation, rather than transmit a Sharedlnval- 

governed by settings in the router lookup table located Broadcast message 430 to the master node (as denned by the 

within the router unit ("Rbox") 200. The directory node 320 45 router table), the directory node 510 simply assumes the 

preferably sends an invalidate message that includes the position of broadcast master. The directory node preferably 

destination cluster number (corresponding to a bit location transmits a response to the requestor node 500 indicating the 

in the share mask) to the Rbox 200. The Rbox, in turn, number of sharing clusters in existence. In general, other 

converts the cluster number to a master processor identifi- clusters may have a shared copy of the data block. If this is 

cation and forwards the SharelnvalBroadcast message to 50 the case, a broadcast invalidate message is sent to the master 

that master processor. processor of the sharing cluster as discussed above. Upon 

Each node in the cluster is configured to propagate the receiving this first message from the directory node 510, the 

SpeciallnvalBroadcast message 440 in predetermined direc- requestor node 500 may change the status of its own copy of 

tions based on the direction from which the incoming the data to exclusive. The directory node 510 also transmits 

message was received. For instance in FIG. 4, slave node 55 the SpeciallnvalBroadcast message 440 to all slaves nodes 

410 receives a SpeciallnvalBroadcast message 440 from in the cluster. This broadcast message fans out and fans in as 

master node 400 at the West input port. A control and status discussed above. Once all children in the cluster (except the 

register ("CSR") 195 contains configuration information requesting node 500) have completed or acknowledged the 

that slave node 410 will use to determine to which nodes the invalidate process, the directory node sends an InvalAck 

SpeciallnvalBroadcast message 440 should be propagated. 60 message to the requestor node 500. If there are other sharing 

This CSR 195 is preferably called the RBOX_W_CFG clusters, the requestor node waits for InvalAck messages 

[BRO] register. Similar registers exist for the north, south, from those clusters before the requestor node 500 can 

and east compass point ports called RBOX_N_CFG modify the data block. 

[BRO], RBOX_S_CFG[BRO], and RBOX_E__CFG The above discussion is meant to be illustrative of the 

[BRO], respectively. In the cluster shown in FIG. 4, the 65 principles and various embodiments of the present inven- 

RBOX_W_CFG[BRO] register indicates that upon receiv- tion. Numerous variations and modifications will become 

ing the SpeciallnvalBroadcast signal from the West, the apparent to those skilled in the art once the above disclosure 
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is fully appreciated. For example, a different means of 
controlling the number of clusters and the number of nodes 
or sub-clusters within clusters may be used. It is intended 
that the following claims be interpreted to embrace all such 
variations and modifications. 5 
What is claimed is: 

1. A method for managing distribution of messages for 
changing the state of shared data in a computer system 
having a main memory, a memory management system, a 
plurality of processors, each processor having an associated 10 
cache, and employing a directory-based cache coherency 
comprising the method of: 

grouping the plurality of processors into a plurality of 
clusters; 

tracking copies of shared data sent to processors in the 35 
clusters; 

receiving an exclusive request from a processor request- 
ing permission to modify a shared copy of the data; 

generating invalidate messages requesting that other pro- 2Q 
cessors sharing the same data invalidate that data; 

sending the invalidate messages only to clusters actually 
containing processors that have a shared copy of the 
data in the associated cache; and 

broadcasting the invalidate message to each processor in 25 
the cluster; 

wherein the invalidate message is sent to one master 
processor in a cluster, and the method further com- 
prises; 

the master processor distributing the invalidate message 30 
to one or more slave processors and waiting for an 
acknowledgement from said one or more processors; 

if said one or more slave processors are configured to do 
so, distributing the invalidate message to one or more 
other slave processors, if any exist, and waiting for an 35 
acknowledgement from said other slave processors; 

a slave processor which does not distribute the invalidate 
message to any other processor replying with an 
acknowledgement to the processor from which the 
invalidate message was received; and 40 

upon receiving acknowledgements from all processors to 
which the invalidate messages were sent, a slave pro- 
cessor replying with an acknowledgement to the pro- 
cessor from which the invalidate message was ^ 
received; 

wherein upon receiving an invalidate message, the pro- 
cessor invalidating a local copy of the shared data, if it 
exists, and wherein upon receiving acknowledgements 
from all slave processors to which the invalidate mes- 5Q 
sages were sent, the master processor sending an invali- 
date acknowledgment message to the processor that 
originally requested the exclusive rights to the shared 
data. 

2. The method of claim 1, wherein: 55 
the slave processors to which the master processor dis- 
tributes the invalidate message are determined by data 
registers associated with the master processor; and 

any other slave processors to which the slave processors 
distribute the invalidate message are determined by 60 
data registers associated with each slave processor; 

wherein data registers exist and may be unique for each 
processor entry port. 

3. The method of claim 1, wherein: 

tracking of the shared copies of the data sent to the 65 
clusters is performed by setting a bit in a data register 
with at least as may bit positions as there are clusters; 
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wherein each cluster is associated with one bit position in 
the data register. 

4. The method of claim 3, wherein sending the invalidate 
messages only to one master processor in a cluster actually 
containing processors that have a shared copy of the data in 
the associated cache further comprises the steps of: 

selecting only the bit positions containing a set bit; 

cross referencing the bit positions with cluster numbers; 

cross referencing cluster numbers with an actual proces- 
sor identification; and 

delivering the invalidate message to the processor asso- 
ciated with the processor identification. 

5. A method for managing distribution of messages for 
changing the state of shared data in a computer system 
having a main memory, a memory management system, a 
plurality of processors, each processor having an associated 
cache, and employing a directory-based cache coherency 
comprising the method of: 

grouping the plurality of processors into a plurality of 
clusters; 

tracking copies of shared data sent to processors in the 
clusters; 

receiving an exclusive request from a processor request- 
ing permission to modify a shared copy of the data; 

generating invalidate messages requesting that other pro- 
cessors sharing the same data invalidate that data; 

sending the invalidate messages only to clusters actually 
containing processors that have a shared copy of the 
data in the associated cache; 

broadcasting the invalidate message to each processor in 
the cluster; 

distributing the main memory among and coupled to each 
of the plurality of processors and each processor com- 
prising a directory controller for the main memory 
coupled to that processor; 

the directory controller managing the main memory loca- 
tion for the share data and tracking the copies of shared 
data sent to processors in the clusters; 

the processor requesting exclusive ownership of the 
shared data delivering the request to the directory 
controller; and 

the directory controller sending the invalidate messages to 
master processors in clusters actually containing pro- 
cessors that have a shared copy of the data. 

6. A method for managing distribution of messages for 
changing the state of shared data in a computer system 
having a main memory, a memory management system, a 
plurality of processors, each processor having an associated 
cache, and employing a directory-based cache coherency 
comprising the method of: 

grouping the plurality of processors into a plurality of 
clusters; 

tracking copies of shared data sent to processors in the 
clusters; 

receiving an exclusive request from a processor request- 
ing permission to modify a shared copy of the data; 

generating invalidate messages requesting that other pro- 
cessors sharing the same data invalidate that data; 

sending the invalidate messages only to clusters actually 
containing processors that have a shared copy of the 
data in the associated cache; 

broadcasting the invalidate message to each processor in 
the cluster; 
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upon receiving a request from a processor requesting 
permission to modify a shared copy of the data, sending 
a response to the requesting processor indicating the 
number of additional shared copies of the data; 

changing the state of the shared data by the requesting 5 
processor from shared to exclusive; and 

waiting to modify the exclusive data until acknowledg- 
ments arrive from the clusters actually containing pro- 
cessors that have a shared copy of the data in the 
associated cache. 10 

7. The method of claim 6, wherein: 

when the processor requesting exclusive ownership of the 
shared data, the directory controller, and shared copies 
of the data exist within the same cluster, the directory 5 
node assumes the position of master node and broad- 
casts the invalidate message to all the processors in the 
cluster. 

8. A multiprocessor system, comprising: 

a main memory configured to store data; 2 o 
a plurality of processors, each processor coupled to at 

least one memory cache; 
a memory directory controller employing directory -based 

cache coherence; 
at least one input/output device coupled to at least one 25 

processor; 

a share mask comprising a data register for tracking 
shared copies of data blocks that are distributed from 
the main memory to one or more cache locations; and 

a PID-SHIFT register which stores configuration settings 
to determine which one of several shared data invali- 
dation schemes shall be implemented; 

wherein when the PID-SHIFT register contains a value of 
zero, the data bits in the share mask data register 35 
correspond to one of the plurality of processors and 
wherein when the PID-SHIFT register contains a non- 
zero value, the data bits in the share mask data register 
correspond to a cluster of processors, each cluster 
comprising more than one of the plurality of processor; 4Q 

wherein if the value in the PID-SHIFT register is zero, the 
directory controller sets the bit in the share mask 
corresponding to the processor to which a shared copy 
of a data block is distributed and wherein if the value 
in the PID-SHIFT register is nonzero, the directory 45 
controller sets the bit in the share mask corresponding 
to the cluster containing a processor to which a shared 
copy of a data block is distributed; and 

wherein the nonzero value in the PID-SHIFT register 
determines the number of processors in each cluster. 50 

9. A multiprocessor system, comprising: 
a main memory configured to store data; 

a plurality of processors, each processor coupled to at 

least one memory cache; 
a memory directory controller employing directory-based 55 

cache coherence; 
at least one input/output device coupled to at least one 

processor; 

a share mask comprising a data register for tracking 6Q 
shared copies of data blocks that are distributed from 
the main memory to one or more cache locations; and 

a PID-SHIFT register which stores configuration settings 
to determine which one of several shared data invali- 
dation schemes shall be implemented; 65 

wherein when the PID-SHIFT register contains a value of 
zero, the data bits in the share mask data register 
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correspond to one of the plurality of processors and 
wherein when the PID-SHIFT register contains a non- 
zero value, the data bits in the share mask data register 
correspond to a cluster of processors, each cluster 
comprising more than one of the plurality of processor; 

wherein when more than one shared copy of a data block 
exists outside of the main memory; and 

wherein in response to a request from a requesting pro- 
cessor for exclusive write access to one of the shared 
copies of the data block; and 

wherein when the value in the PID-SHIFT register is zero, 
the directory controller transmits an invalidate message 
only to those processors whose corresponding bits in 
the share mask are set, except the requesting processor; 
and 

wherein when the value in the PID-SHIFT register is 
nonzero, the directory controller transmits an invalidate 
message only to those clusters whose corresponding 
bits in the share mask are set. 

10. The system of claim 9 wherein the cluster further 
comprises: 

a master processor to which the invalidate message 
directed toward the cluster are delivered; and 

one or more slave processors, each of which receive an 
invalidate message that is generated by the master 
processor. 

11. The system of claim 10 further comprising: 

a processor router table that includes cross reference 
information which correlates master processor identi- 
fication with cluster numbers. 

12. The system of claim 10 further comprising: 
configuration registers associated with each port of a 

processor in a cluster which determine the path by 
which the invalidate message is broadcast within a 
cluster. 

13. A multiprocessor system, comprising: 
a memory; 

multiple computer processor nodes, each with an associ- 
ated memory cache; and 

a memory controller employing a directory-based cache 
coherency employing shared memory invalidation 
method, wherein: 

the nodes are grouped into clusters; 

the memory controller distributes memory blocks from 

the memory to the various cache locations at the 

request of the associated nodes; 
upon receiving a request for exclusive ownership of 

one of the shared memory blocks, the memory 

controller distributes invalidate messages via direct 

point to point transmission to only those clusters 

containing nodes that share a block of data in the 

associated cache; and 
wherein when the invalidate message is received by a 

cluster, an invalidate message is broadcast to all 

nodes in the cluster; 
the system further comprising: 

a share mask data register with as many bit locations as 

there are clusters; 
a router lookup table with cross reference information 

correlating bit locations in the share mask to one 

master nodes in each cluster; 
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wherein the memory controller determines to which 
cluster to send the invalidate message according to 
bits set in the share mask and sends the invalidate 
message to the router which then forwards the invali- 
date message to the node whose identification cor- 5 
responds to the cluster number as indicated in the 
router table. 

14. The system of claim 13 each node further comprising: 
router control and status registers for each input port of 10 
the node which configure the node's broadcast forward- 
ing scheme wherein the forwarding scheme determines 
to which, if any, nodes the node shall forward a 
broadcast invalidate message when a broadcast invali- 
date message is received at a given port. 
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15. The system of claim 14 wherein: 

the router control and status registers are comprised of bit 
locations corresponding to each output port of the node; 
and 

wherein if a bit location contains a set bit, the invalidate 

message is forwarded to the output port corresponding 

to that bit location; and 
wherein if a bit location does not contain a set bit, the 

invalidate message is not forwarded to the output port 

corresponding to that bit location. 

16. The system of claim 15 wherein the processors in a 
cluster invalidate shared data, if it exists, and generate and 
forward acknowledgments in reverse direction but along the 
same path followed by the invalidate messages. 

* * * * * 
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